
Designing an efficient memory hierarchy for multi-core architectures is a critical and challenging
task. As the number of cores on a chip increase with technology scaling, the demand on the on-chip
memory would increase significantly, further worsening the memory wall problem~\cite{BurgerGK96}. The
memory wall problem is critical both from the performance (memory density) and power perspectives.
Thus, novel technology, circuit, and architectural techniques are currently being explored to address
the memory wall problem for many-core systems.


% Dr. Yuan Changes to first two lines.
%Spin-Transfer Torque RAM (STT-RAM) is a promising memory technology that has the potential to replace the conventional %on-chip SRAM
%caches,  because of its higher density, competitive read times and lower leakage power consumption
%compared to SRAM.



Spin-Transfer Torque RAM (STT-RAM) is a promising memory technology that delivers on many aspects,
desirable of a universal memory. It has the potential to replace the conventional on-chip SRAM
caches because of its higher density, competitive read times, and lower leakage power consumption
compared to static-RAM (SRAM). However, the latency and energy overhead associated with the write operations are the key drawbacks
of this technology for providing competitive or better performance compared to the SRAM-based cache
hierarchy. Consequently, recent efforts have focused on masking the effects of high write latencies
and write energy at the architectural level~\cite{mram-energy-reduction,gsun-hpca}. In contrast to
these architectural approaches, a recent  work explored the {\it feasibility} of relaxing STT-RAM
data retention times to reduce both write latencies and write energy~\cite{STTRAM:HPCA11}. This
adaptable feature of tuning the data retention time can be exploited in several dimensions. The focus
of this paper is to tune this data retention time to closely match the required refresh time of the Last
Level Cache (LLC) blocks to achieve significant performance and energy gains. In this context, the paper
addresses several design issues such as how to decide an appropriate retention time for the LLC,
what the relationship between retention time and write latency is, and how we
architect the cache hierarchy with a volatile STT-RAM.

The non-volatile nature and non-destructive read ability of STT-RAM provides a key difference with
regard to traditional on-chip cache design with SRAM technology. However, as our analysis will show,
for many applications, it is sufficient if the data stored in the last level of a cache hierarchy
remains valid for a few tens of $ms$ in contrast to {\it $\mu$s} for the L1 cache~\cite{3t1d-cache}.
Consequently, the duration of data retention in STT-RAM
is an obvious candidate for device optimization for the cache design. We, therefore, conduct an
application-driven study to analyze the inter-write times (refresh times) of the L2 cache blocks to decide a suitable
data retention time. Although lifetime analysis of cache lines has been conducted earlier to
improve performance and reduce power consumption~\cite{cache-decay-2001,3t1d-cache}, we revisit this
topic with a different intention - correlating STT-RAM data-retention time to cache lifetime. An
extensive analysis of PARSEC 2.1~\cite{bienia11benchmarking} and
SPEC2K6~\cite{spec} benchmarks using the M5 simulator~\cite{M5} indicates that
the average inter-write times for most of the L2 cache blocks is close to $10ms$, and thus, we advocate
our STT-RAM design with this retention time.

The first effort to vary the retention time of STT-RAM cells to facilitate the design of STT-RAM-based
cache hierarchy was proposed in ~\cite{STTRAM:HPCA11}.
In this work, the authors varied the MTJ cell size from 32$F^2$ to 10$F^2$ to reduce
the retention time to {\it $\mu$s} range. Since, the state-of-the-art MTJ design is in the range of
2-3$F^2$~\cite{STTRAM:Grandis11}, there is not much room for further MTJ planar area reduction. 
Hence, we focus on varying the material property and the thickness of MTJ to reduce the retention time.
Also, we do not reduce the MTJ thickness beyond $2nm$ (which can lead to {\it $\mu$s} retention time) 
to avoid reliability problems. We conduct a detailed device level analysis of the
STT-RAM cells to analyze the write current versus write pulse width tradeoffs,
cell area analysis, and retention time stability analysis to capture the
relationship between area, read/write latency and leakage power as a function of the retention time.
Our observations demonstrate that the retention times in the range of $ms$ are more practical compared to the retention time in the {\it $\mu$s} range as described in~\cite{STTRAM:HPCA11}.
Consequently, using this $ms$ range retention time model, we propose effective
architectural techniques to avoid any data loss due to the volatile STT-RAM-based cache hierarchy.

%Our observations demonstrate that the retention times in the range of
%Our observations, in contrast to the results reported in~\cite{STTRAM:HPCA11} indicate that retention
%times in the range of milliseconds ({\it ms}) are probably more achievable than in the microseconds
%({\it $\mu$s}) range. Thus, using this {\it ms} range retention time model, we propose effective
%architectural techniques to avoid any data loss due the volatile STT-RAM-based cache hierarchy.

A key challenge in determining a suitable data-retention time for the STT-RAM is to balance the
reduced write latency of STT-RAM cells with lower retention time against the overhead for data refresh or
write-back of cache lines with longer lifetimes. In this paper, we compare 3 different STT-RAM based
cache designs: (1) STT-RAM cache without retention time relaxation (10+ years of data retention
time); (2) STT-RAM cache with retention time of $1sec$, which is long enough for the inter-write time of
majority of the cache lines, and therefore, no refreshing overhead is incurred;  and (3) STT-RAM cache with
retention time of $10ms$, which is a more aggressive design with better performance/energy gain, but a
data refreshing technique is needed for correct operations, since cache lines that have inter-write times
exceeding $10ms$ are likely to lose data. Thus, we propose simple extensions to the L2 cache design
for avoiding any data loss. This includes a simple 2-bit counter (similar to one proposed
in~\cite{cache-decay-2001}) to keep track of the inter-write times of all the cache blocks and a small buffer
to temporarily store the blocks whose inter-write time has exceeded the retention time. We conduct
execution-driven analysis of our proposed techniques using the M5 simulator with
a suite of PARSEC 2.1 and SPEC 2K6 benchmarks.

The main {\bf contributions} of this paper can be summarized as: \\
\noindent\textbf{Detailed characterization of STT-RAM volatile property:} We present a detailed
device characterization of data-retention tunability in STT-RAM cells, providing insight to the
underlying principles enabling these tradeoffs.

\noindent\textbf{An application-driven study to determine retention time:}
Our application level characterization of the time between writes or replacements to a
LLC block augments the prior body of work that analyzes cache lifetimes mainly
in single processor and single program configurations. Based on the L2 cache behavior,
we propose to design STT-RAMs with retention time in the range of $10ms$.
%We believe that analyzing this broad spectrum of applications
%and architecting the whole cache hierarchy with single retention time STT-RAM
%cells is a non-trivial task.
Also, such a design
makes the LLC design homogeneous (all same type of STT-RAM cells) leading to lower die cost and ease in fabrication.

\noindent\textbf{Architectural solution to handle STT-RAM volatility:} We present a simple buffering
mechanism to ensure the integrity of the programs given the volatile nature of our tuned STT-RAM cells.
This scheme is simple, yet very energy and performance efficient compared to the DRAM-style refresh.
%This scheme is 26x less power consuming than DRAM-style refreshing proposed ~\cite{STTRAM:HPCA11}
%in addition to the 25x area and 22x performance benefits.
Experimental results with PARSEC 2.1 and SPEC 2K6 benchmarks on a four-core platform
compared to a base case 1MB SRAM per core and the ideal 4MB SRAM per core indicate that the proposed
solution is attractive both from performance and power perspectives. On an average, we find 18\% speedup
improvement with PARSEC 2.1 benchmarks, 10\%/4\% instruction throughput/weighted speedup improvement
with SPEC 2K6 benchmarks, and an average energy saving of 60\%, over 1MB SRAM across our entire application suite.


%\noindent\textbf{Comparison with prior architectural proposals:} We show that our scheme is better by
%XX\% when compared to a recent write latency hiding scheme proposed in~\cite{}. Additionally, all our
%results are with a write-buffer ...


The rest of this paper is as follows: In Section~\ref{sec:design}, we discuss the volatile STT-RAM design to
parameterize the retention time and write latency behavior. Section~\ref{sec:motivation} presents a retention time
estimation study from application perspective. Following this, the design of a volatile STT-RAM-based LLC architecture is given in Section~\ref{sec:implementation}; the experimental platform details in Section~\ref{sec:evaluation};
results in Section~\ref{sec:results}, and description of related work in Section~\ref{sec:prior_work}.
The last section provides concluding remarks.


%- Finally, we show that our combined device-architecture life time
%tuning approach is better than recent efforts that attempt to address the
%long write latencies of STT-RAM.

%cache lifetimes performed for a multi-threaded workload demonstrates
%that a significant fraction of L2 cache lines can operate correctly
%without any additional support when the STT-RAM retention times are of
%the order of 50ms. However, architectural support is required to ensure
%that correct program state is maintained for the rest of the cache lines
%that have lifetimes exceeding 50ms. While a simple
%DRAM-style refresh has been proposed in~\cite{STTRAM:HPCA11} to ensure
%correctness, it is possible to avoid many of these refresh by pursuing a
%life-time aware refresh strategy.
